Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] fixed the ReorderInstructions pass #5254

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ravil-mobile
Copy link
Contributor

@ravil-mobile ravil-mobile commented Nov 25, 2024

  • fixed local store and global load ordering for GEMM kernels

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because the current tests are supposed to cover code changes
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

Copy link
Contributor

@sjw36 sjw36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments below

if (isPureMatmulProblem(funcOp)) {
scheduleGlobalLoadLocalStore(funcOp);
sinkSecondLoad(funcOp);
const bool independentGlobalLoadStages =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be independent of whether pipelining has happened or not. It previously applied to all loads in the function, not just in the for loop.

// Best perf on GEMM when these precede global loads.
funcOp.walk([&](ttg::LocalStoreOp op) { moveOps.push_back(op); });

if (independentGlobalLoadStages) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already accounted for in the test for local_store and leadsToLoad on line 260.

Therefore, the order here can universally be:

  • local_stores
  • global_loads

Then reversed (line 233) so the local_stores will be moved last and to the top of the loop if they are independent of global loads.

Comment on lines 216 to 220
forOp.walk([&](ttg::LocalStoreOp op) { moveOps.push_back(op); });

// Move global loads early to prefetch. This may increase register pressure
// but it enables issuing global loads early.
forOp.walk([&](triton::LoadOp op) { moveOps.push_back(op); });
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjw36, If I understood you correctly. We only need this change, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really the only source change needed.

Copy link
Contributor

@sjw36 sjw36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just do the swap for now and fixup tests.

@@ -58,7 +53,7 @@ findEarlyInsertionPoint(Block *block, Operation *move) {
if (isa<triton::AtomicRMWOp, triton::AtomicCASOp>(wop))
ipnt = bi;
// Break at barrier
if (isa<gpu::BarrierOp>(wop))
if (isa<mlir::gpu::BarrierOp>(wop))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be unnecessary since it built before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happened because of the include of "third_party/amd/lib/TritonAMDGPUToLLVM/Utility.h". In its turn, it include "triton/Conversion/TritonGPUToLLVM/Utility.h" which has the following definition

namespace gpu {
Type getFunctionType(Type resultType, ValueRange operands);
LLVM::LLVMFuncOp appendOrGetExternFuncOp(RewriterBase &rewriter, Operation *op,
StringRef funcName, Type funcType,
StringRef libname = "",
StringRef libpath = "");
} // namespace gpu

As you can see it is gpu namespace. So, we need to be explicit about the namespace in ReorderInstructions.cpp file. Otherwise, we have a compilation error.

@@ -214,14 +209,15 @@ static void moveUpTranspose(triton::FuncOp funcOp) {
}

// Schedule global load and local store ops for better GEMM performance.
static void scheduleGlobalLoadLocalStore(triton::FuncOp funcOp) {
static void scheduleGlobalLoadLocalStore(scf::ForOp forOp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have implications for other workloads, where it is beneficial to apply outside of for loops. Let's keep it as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we had a problem with persistent_streamk kernel because there was 2 nested scf.ForOp. The inner one (which was pure gemm) was not captured by our pattern matcher.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjw36, this function is only executed when we are dealing with pureMatmulProblems:

if (isPureMatmulProblem(funcOp)) {
scheduleGlobalLoadLocalStore(funcOp);
sinkSecondLoad(funcOp);
}

Comment on lines 216 to 220
forOp.walk([&](ttg::LocalStoreOp op) { moveOps.push_back(op); });

// Move global loads early to prefetch. This may increase register pressure
// but it enables issuing global loads early.
forOp.walk([&](triton::LoadOp op) { moveOps.push_back(op); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really the only source change needed.

@ravil-mobile ravil-mobile changed the title [AMD] extended the ReorderInstructions pass to handle special cases [AMD] fixed the ReorderInstructions pass Nov 26, 2024
@ravil-mobile
Copy link
Contributor Author

Hi @sjw36,

I did another revert. I will add a fix for persistent_streamk in a separate PR (in this case it would be easy to revert in the case if it would be needed).

Copy link
Contributor

@sjw36 sjw36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Let's run perf to verify no regressions.

@zhanglx13 zhanglx13 marked this pull request as ready for review December 2, 2024 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants